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Abstract 

In conventional method, distributed support vector machines (SVM) 
algorithms are trained over pre-configured intranet /internet environments 
to find out an optimal classifier. These methods are very complicated and 
costly for large datasets. Hence, we propose a method that is referred as 
the Cloud SVM training mechanism (CloudSVM) in a cloud computing 
environment with MapReduce technique for distributed machine learning 
applications. Accordingly, (i) SVM algorithm is trained in distributed 
cloud storage servers that work concurrently; (ii) merge all support vectors 
in every trained cloud node; and (iii) iterate these two steps until the SVM 
converges to the optimal classifier function. Large scale data sets are not 
possible to train using SVM algorithm on a single computer. The results 
of this study are important for training of large scale data sets for machine 
learning applications. We provided that iterative training of splitted data 
set in cloud computing environment using SVM will converge to a global 
optimal classifier in finite iteration size. 



1 Introduction 

Machine learning applications generally require large amounts of computation 
time and storage space. Learning algorithms have to be scaled up to handle 
extremely large data sets. When the training set is large, not all the examples 
can be loaded into memory in training phase of the machine learning algorithm 
at one step. It is required to distribute computation and memory requirements 
among several connected computers. 

In machine learning field, support vector machincs(SVM) offers most ro- 
bust and accurate classification method due to their generalized properties. 
With its solid theoretical foundation and also proven effectiveness, SVM has 
contributed to researchers' success in many fields. But, SVM's suffer from a 
widely recognized scalability problem in both memory requirement and compu- 



tational timepQ. SVM algorithm's computation and memory requirements in- 
crease rapidly with the number of instances in data set, many data sets are not 
suitable for classification 14 . The SVM algorithm is formulated as quadratic 
optimization problem. Quadratic optimization problem has 0(m 3 ) time and 
0{m?) space complexity, where m is the training set size[5]. The computation 
time of SVM training is quadratic in the number of training instances. 

The first approach to overcome large scale data set training is to reduce fea- 
ture vector size. Feature selection and feature transformation methods are basic 
approaches for reducing vector size [3]. Feature selection algorithms choose a 
subset of the features from the original feature set and feature transformation 
algorithms creates new data from the original feature space to a new space 
with reduced dimensionality. In literature, there are several methods; Singular 
Value Decomposition (SVD)|4 ], Principal Component Analysis (PCA) [5], In- 
dependent Component Analysis (ICA)[B], Correlation Based Feature Selection 
(CFS) 7J, Sampling based data set selection. All of these methods have a big 
problem for generalization of final machine learning model. 

Second approach for large scale data set training is chunking [13] . Collobert 
et al. [12 propose a parallel SVM training algorithm that each subset of whole 
dataset is trained with SVM and then the classifiers are combined into a fi- 
nal single classifier. Lu et al. |8| proposed distributed support vector machine 
(DSVM) algorithm that finds support vectors (SVs) on strongly connected net- 
works. Each site within a strongly connected network classifies subsets of train- 
ing data locally via SVM and passes the calculated SVs to its descendant sites 
and receives SVs from its ancestor sites and recalculates the SVs and passes 
them to its descendant sites and so on. Ruping et al.[3] proposed incremental 
learning with Support Vector Machine. One needs to make an error on the 
old Support Vectors(which represent the old learning set) more costly than an 
error on a new example. Syed et al. [10] proposed the distributed support 
vector machine (DSVM) algorithm that finds SVs locally and processes them 
altogether in a central processing center. Caragea et al. [IT] in 2005 improved 
this algorithm by allowing the data processing center to send support vectors 
back to the distributed data source and iteratively achieve the global optimum. 
Graf et al. [14] had an algorithm that implemented distributed processors into 
cascade top-down network topology, namely Cascade SVM. The bottom node 
of the network is the central processing center. The distributed SVM methods 
in these works converge and increase test accuracy. All of these works have 
similar problems. They require a pre-defined network topology and computer 
size in their network. The performance of training depends on the special net- 
work configuration. Main idea of current distributed SVM methods is first data 
chunking then parallel implementation of SVM training. Global synchroniza- 
tion overheads are not considered in these approaches. 

In this paper, we propose a Cloud Computing based SVM method with 
MapReduce [18] technique for distributed training phase of algorithm. By split- 
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ting training set over a cloud computing system's data nodes, each subset is 
optimized iteratively to find out a single global classifier. The basic idea behind 
this approach is to collect SVs from every optimized subset of training set at 
each cloud node, then merge them to save as global support vectors. Comput- 
ers in cloud computing system exchange only minimum number of training set 
samples. Our algorithm CloudSVM is analysed with various public datasets. 
CloudSVM is built on the LibSVM and implemented using the Hadoop imple- 
mentation of MapReduce. 

This paper is organized as follows. In section 2, we will provide an overview 
to SVM formulations. In Section 3, presents the Map Reduce pattern in detail. 
Section 4 explains system model with our implementation of the Map Reduce 
pattern for the SVM training. In section 5, convergence of CloudSVM is ex- 
plained. In section 6, simulation results with various UCI datasets are shown. 
Thereafter, we will give concluding remarks in Section 7. 

2 Support Vector Machine 

Support vector machine is a supervised learning method in statistics and com- 
puter science, to analyse data and recognize patterns, used for classification and 
regression analysis. The standard SVM takes a set of input data and predicts, 
for each given input, which of two possible classes forms the input, making 
the SVM a non-probabilistic binary linear classifier. Note that if the training 
data[singular/plural] are linearly separable as shown in figure [Tj we can select 
the two hyperplanes of the margin in a way that there are no points between 
them and then try to maximize their distance. By using geometry, we find the 
distance between these two hyperplanes is 2/||w||. Given some training data T> 1 
a set of n points of the form 

V = {(xuVi) I Xi G R m , m G {-1, 1} KU (1) 

where Xi is an m-dimensional real vector, yi is either -1 or 1 denoting the class 
to which point Xi belongs. SVMs aim to search a hyperplane in the Reproduc- 
ing Kernel Hilbert Space (RKHS) that maximizes the margin between the two 
classes of data in T> with the smallest training error [13]. This problem can be 
formulated as the following quadratic optimization problem: 

^ m 

minimize :P(w, b, £) = -||w|| 2 + C & 

1=1 (2) 

subjectto :yi((w, </>(xj)) + b) > 1 — 

6 >o 

for % — 1, ...,m, where ^ are slack variables and C is a constant denoting the 
cost of each slack. C is a trade-off parameter which controls the maximization of 
the margin and minimizing the training error. The decision function of SVMs is 
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Figure 1: Binary classification of an SVM with Maximum-margin hyperplane 
trained with samples from two classes. Samples on the margin are called the 
support vectors. 

/(x) = w T <fi(x) +b where the w and b are obtained by solving the optimization 
problem P in ([2]). By using Lagrange multipliers , the optimization problem P 
in ^ can be expressed as 

min :F{ct) — -a Qa — a 1 
subjectto :0 < a < C ( 3 ) 

y T « = o 

where [Q]^ = yiUj4> T (xi)4>( x j) is the Lagrangian multiplier variable. It 
is not need to know <j), but it is necessary to know is how to compute the 
modified inner product which will be called as kernel function represented as 
K(xi,Xj) = T (xj)(/>(x.j). Thus, [Q]^ = yiy 3 K{xi,Xj). Choosing a positive 
definite kernel K, by Mercers theorem, then optimization problem P is a convex 
quadratic programming (QP) problem with linear constraints and can be solved 
in polynomial time. 

3 MapReduce 

MapReduce is a programming model derived from the map and reduce function 
combination from functional programming. MapReduce model widely used to 
run parallel applications for large scale data sets processing. Users specify a 
map function that processes a key/value pair to generate a set of intermediate 
key/value pairs, and a reduce function that merges all intermediate values as- 
sociated with the same intermediate key [18]. MapReduce is divided into two 
major phases called map and reduce, separated by an internal shuffle phase 
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of the intermediate results. The framework automatically executes those func- 
tions in parallel over any number of processors [T5]. Simply, a MapReduce job 
executes three basic operations on a data set distributed across many shared- 
nothing cluster nodes. First task is Map function that processes in parallel 
manner by each node without transferring any data with other notes. In next 
operation, processed data by Map function is repartitioned across all nodes of 
the cluster. Lastly, Reduce task is executed in parallel manner by each node 
with partitioned data. 
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Figure 2: Overview of MapReduce System 



A file in the distributed file system (DFS) is split into multiple chunks and 
each chunk is stored on different data-nodes. A map function takes a key/value 
pair as input from input chunks and produces a list of key/value pairs as output. 
The type of output key and value can be different from input key and value: 

map(keyi, valuei) => list(key2, value?) 

A reduce function takes a key and associated value list as input and generates 
a list of new values as output: 

reduce(key2,list(value2)) list(values) 

Each Reduce call typically produces either one value U3 or an empty return, 
though one call is allowed to return more than one value. The returns of all 
calls are collected as the desired result list. Main advantage of MapReduce system 
is that it allows distributed processing of submitted job on the subset of a whole 
dataset in the network. 
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4 System Model 




Figure 3: Schematic of Cloud SVM architecture. 



CloudSVM is a MapReduce based SVM training algorithm that runs in parallel 
on multiple commodity computers with Hadoop. As shown in figure |3j the 
training set of the algorithm is split into subsets and each one is evaluated 
individually to get a values (i.e. support vectors). In Map stage of MapReduce 
job, the subset of training set is combined with global support vectors. In 
Reduce step, the merged subset of training data is evaluated. The resulting 
new support vectors are combined with the global support vectors in Reduce 
step. The CloudSVM with MapReduce algorithm can be explained as follows. 
First, each computer within a cloud computing system reads the global support 
vectors, then merges global SVs with subsets of local training data and classifies 
via SVM. Finally, all the computed SVs in cloud computers are merged. Thus, 
algorithm saves global SVs with new ones. The algorithm of CloudSVM consists 
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of the following steps. 

1. As initialization the global support vector set as t — 0, V 1 — 

2. t = t + 1; 

3. For any computer in 1,1 = 1, L reads global SVs and merge them with 
subset of training data. 

4. Train SVM algorithm with merged new data set 

5. Find out support vectors 

6. After all computers in cloud system complete their training phase, merge 
all calculated SVs and save the result to the global SVs 

7. If ft* — ft,* -1 stop, otherwise go to step 2 

Pseudo code of CloudSVM Algorithm's Map and Reduce function are given 
in Algorithm^ and Algorithm^ 

Algorithm 1 Map Function of CloudSVM Algorithm 
SVciobai = // Empty global support vector set 
while ft* ^ ft'" 1 do 

for I E L // For each subset loop do 

V\ <- V\ U SVt lobal 
end for 
end while 



Algorithm 2 Reduce Function of CloudSVM Algorithm 
while ft* ^ ft* 1 do 
for / G L do 

SV^h* svm(T>i) If Train merged Dataset to obtain Support Vectors 

and Hypothesis 
end for 
for / £ L do 

SVciobai SVciobai U 5V; 

end for 
end while 



For training SVM classifier functions, we used LibSVM with various kernels. 
Appropriate parameters C and 7 values were found by cross validation test. 
All system is implemented with Hadoop and streaming Python package mrjob 
library. 
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5 Convergence of CloudSVM 



Let S denotes a subset of training set T>, F(S) is the optimal objective function 
over data set 5, h* is the global optimal hypothesis for which has a minimal 
empirical risk R ernp (h). Our algorithm starts with SVg ;oba; = 0, and generates 
a non-increasing sequence of positive set of vectors SVQ (oilal , where SVq^^ 
is the vector of support vector at the t.th iteration. We used hinge loss for 
testing our models trained with CloudSVM algorithm. Hinge loss works well 
for its purposes in SVM as a classifier, since the more you violate the margin, 
the higher the penalty is 20 . The hinge loss function is the following: 

Kf( x )' v) = max {°! 1 _ V-f( x )} 
Empirical risk can be computed with an approximation: 

1 ^ 



i=l 



l(h(xi),yi 



According to the empirical risk minimization principle the learning algorithm 
should choose a hypothesis h which minimizes the empirical risk: 



h = argmini?emp(fr)- 



A hypothesis is found in every cloud node. Let X be a subset of training data at 



cloud node i where X G R" 



SV 



Global 



is the vector of support vector at the 



t.th iteration, h t,% is hypothesis at node i with iteration t, then the optimization 
problem in equation [3] becomes 



maximize h t,L = — 



ai 
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Qn Q12 
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ai 


a 2 




Q21 Q22 




a 2 




1 




Ct2 



(4) 



subjectto : < a, < C, Vi and a iVi = 



where Q12 and Q21 are kernel matrices with respect to 

Q12 = {Ki,j{x ij ,SV^ lobal(i!j) )\i= l,...,m, j = l,...,n| . 

ol\ and a-2 are the solutions estimated by node i with dataset X and SV Global ■ 
Because of the Mercer's theorem, our kernel matrix Q is a symmetric positive- 
definite function on a square. Then our sub matrices Q12 and Q21 must be 
equal. 

We can define Qn and Q22 matrices such that 

Q11 = { K i,ji x i,v x i,3)\ x i,i G Af, j = 1, ...,m, j = 1, ...,n} 
Q22 = {K l . j (SV G iobaU SV G iobai)\i = l,...,m,j = l,...,n} 



8 



at iteration t. 

Algorithm's stop point is reached when the hypothesis' empirical risk is same 
with previous iteration. That is: 

Rempih 1 ) — Rem^h 1 1 ) (5) 

Lemma : Accuracy of the decision function of CloudSVM classifier at iteration 
t is always greater or equal to the maximum accuracy of the decision function 
of SVM classifier at iteration t — 1. That is 

Rempih 1 ) < arg min i?emp(^) (6) 

Proof /Without loss of generality, Iterated CloudSVM monotonically con- 
verges to optimum classifier. 

SV^ oba/ = SV^ ai U {SV^ 1 \i=l, ...n) 

where n is the data set split size(or cloud node size). Then, training set for svm 
algorithm at node i is 

d = XUSV t Global 

Adding more samples cannot decrease the optimal value. Accuracy of the 
sub problem in each node monotonically increases in each step. 

6 Simulation Results 



Table 1: The datasets used in experiments 



Dataset Name 


Train. Data 


Dim. 


German 


1000 


24 


Heart 


270 


13 


Ionosphere 


351 


34 


Satellite 


4435 


3G 



We have selected several data sets from the UCI Machine Learning Repos- 
itory, namely, German, Heart, Ionosphere, Hand Digit and Satellite. The data 
sets length and input dimensions are shown in Table [T] We test our algorithm 
over a real-word data sets to demonstrate the convergence. Linear kernels were 
used with optimal parameters (7, C). Parameters were estimated by cross- 
validation method. 

We used 10-fold cross-validation, dividing the set of samples at random into 
10 approximately equal-size parts. The 10 parts were roughly balanced, ensur- 
ing that the classes were distributed uniformly to each of the 10 parts. Ten-fold 
cross-validation works as follows: we fit the model on 90% of the samples and 
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Table 2: Performance Results of CloudSVM algorithm with various UCI 
datasets 7 



Dataset Name 


7 


c 


No. Of Iteration 


No. of SVs 


Accuracy 


Kernel Type 


German 


10 u 


1 


5 


606 


0.7728 


Linear 


Heart 


10° 


1 


3 


137 


0.8259 


Linear 


Ionosphere 


10 8 


1 


3 


160 


0.8423 


Linear 


Satellite 


10° 


1 


2 


1384 


0.9064 


Linear 



then predict the class labels of the remaining 10% (the test samples). This pro- 
cedure is repeated 10 times, with each part playing the role of the test samples 
and the errors on all 10 parts added together to compute the overall error. 
To analyse the CloudSVM, we randomly distributed all the training data to a 
cloud computing system with 10 computers with pseudo distributed Hadoop. 
Data set prediction accuracy with iterations and total number of SVs are shown 
in Table [3] When iteration size become 3-5, test accuracy values of all data 
sets reach to the highest values. If the iteration size is increased, the value of 
test accuracy falls into a steady state. The value of test accuracy is not changed 
for large enough number of iteration size. 

When the iteration size is increased, the number of global support vectors 
are passed the steady-state condition. As a result, the CloudSVM algorithm is 
useful for large size training data. 

7 Conclusion and Further Research 

We have proposed distributed support vector machine implementation in cloud 
computing systems with MapReduce technique that improves scalability and 
parallelism of split data set training. The performance and generalization prop- 
erty of our algorithm are evaluated in Hadoop. Our algorithm is able to work on 
cloud computing systems without knowing how many computers connected to 
run parallel. The algorithm is designed to deal with large scale data set training 
problems. It is empirically shown that the generalization performance and the 
risk minimization of our algorithm are better than the previous results. 
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